feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3 by vikrantpuppala · Pull Request #787 · databricks/databricks-sql-python

vikrantpuppala · 2026-05-14T15:11:20Z

Summary

Phase 2 of the PySQL × kernel integration plan (design doc). Adds a new opt-in use_kernel=True connection flag that routes through a new backend/kernel/ module delegating to the Rust kernel via the databricks_sql_kernel PyO3 extension (kernel PR #13).

This replaces the previous ADBC POC branches (backend/adbc/ and backend/adbc_dm/ on adbc-rust-backend-via-dm, which were never merged) with a clean port that uses the kernel's v0 Databricks-native API directly instead of layering through ADBC.

Flag semantics

Flag	Routes to	Status
`use_kernel=True`	new `KernelDatabricksClient` (Rust kernel via PyO3)	new, opt-in, in active development
`use_sea=True`	existing `SeaDatabricksClient` (pure-Python SEA)	unchanged
neither	`ThriftDatabricksClient` (Thrift)	unchanged

use_kernel=True and use_sea=True are mutually exclusive — passing both raises ValueError. Existing use_sea=True callers are unaffected.

What `use_kernel=True` does

import databricks.sql

# PAT auth — works end-to-end today
with databricks.sql.connect(
    server_hostname=...,
    http_path=...,
    access_token="dapi...",
    use_kernel=True,
) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM main.default.t LIMIT 100")
        rows = cur.fetchall()  # rows come from the kernel via PyO3

New module layout

File	Purpose
`src/databricks/sql/backend/kernel/client.py`	`KernelDatabricksClient(DatabricksClient)`. Lazy-imports `databricks_sql_kernel` so a connector install without the kernel wheel doesn't fail at startup — only `use_kernel=True` surfaces the missing-extra ImportError.
`src/databricks/sql/backend/kernel/auth_bridge.py`	Translates the connector's `AuthProvider` to kernel `Session` auth kwargs. PAT (including `TokenFederationProvider`-wrapped PAT — every provider is wrapped, so the naive `isinstance(AccessTokenAuthProvider)` check has to look through the wrapper) routes through `auth_type='pat'`. Anything else raises `NotSupportedError` until the kernel exposes a full external-auth surface.
`src/databricks/sql/backend/kernel/result_set.py`	`KernelResultSet(ResultSet)`. Duck-typed over the kernel's `ExecutedStatement` (sync exec) and `ResultStream` (metadata + async `await_result`); both expose `arrow_schema / fetch_next_batch / close`. FIFO batch buffer for `fetchmany(n)` semantics, with O(1) buffered-row accounting via a running counter.
`src/databricks/sql/backend/kernel/type_mapping.py`	Arrow → PEP 249 description-string mapper.

Error mapping

KernelError.code → PEP 249 exception class, in a single table in client.py. Structured fields (sql_state, error_code, query_id, http_status, retryable, vendor_code) are copied onto the re-raised exception so callers can branch on err.code / err.sql_state directly. Live e2e verified: bad SQL on use_kernel=True surfaces as DatabaseError(code='SqlError', sql_state='42P01').

Packaging

Without the kernel wheel, use_kernel=True raises:

ImportError: use_kernel=True requires the databricks-sql-kernel package.
Install it with: pip install databricks-sql-kernel

Local dev: cd databricks-sql-kernel/pyo3 && maturin develop --release into the connector's venv. (The [kernel] extra is intentionally not declared in pyproject.toml yet — databricks-sql-kernel isn't on PyPI, and declaring an unpublished dep breaks poetry lock for every CI job. The extra will land once the wheel is on PyPI.)

⚠️ Known gaps — acknowledged follow-ups

Gap	Surface	Why deferred
Parameter binding	`execute_command(parameters=[...])` raises `NotSupportedError`	PyO3 `Statement.bind_param` lands in a small follow-up PR on the kernel repo
Statement-level `query_tags`	Raises `NotSupportedError`	Same follow-up — PyO3 needs to expose `statement_conf`
`get_columns(catalog_name=None)`	Raises `ProgrammingError` (kernel's `SHOW COLUMNS` cannot span catalogs)	Documented divergence from Thrift/SEA; called out on `Cursor.columns()` docstring
OAuth / federation / external-auth	`NotSupportedError` from the auth bridge	Kernel-side enablement PR covers this — once it lands, OAuth/MSAL/federation start working through this backend automatically
Volume PUT/GET (staging)	Not supported	Kernel has no Volume API yet
Telemetry parity	Per-statement `execution_result` / `retry_count` / chunk-level latency under-report on `use_kernel=True`	Needs kernel-side hooks; will plumb once available
Connector kwargs (TLS / proxy / retry / pool)	Honored by the kernel's own HTTP stack rather than the connector's; user-supplied values via `ssl_options` / `http_headers` / `http_client` are accepted-and-ignored	Kernel manages its own HTTP stack today; will plumb a per-knob bridge as kernel surfaces them

Code review feedback addressed in this revision

Multi-reviewer (architecture, security, ops, performance, test, maintainability, agent-compat, language, devil's advocate) review surfaced several issues; the highest-impact ones are addressed in commits 37fa5446 (mechanical) and 24e9a5c2 (substantive):

Renamed the flag from repurposing use_sea=True to a dedicated use_kernel=True; native SEA routing is unchanged. (Was the largest reviewer concern.)
Implemented client-side table_types filter in get_tables using the SEA backend's _filter_arrow_table helper, replacing the previous "log a warning and return unfiltered" behaviour.
Replaced O(M²) buffer accounting in KernelResultSet with a running counter (_buffered_count).
KernelResultSet.close() now drops the entry from backend._async_handles to avoid stale references.
threading.RLock around _async_handles mutations / reads for concurrent-cursor safety.
Synthetic metadata CommandIds use plain uuid.uuid4().hex (no metadata- prefix) so cursor.query_id stays parseable downstream.
Raw PAT cleared from KernelDatabricksClient._auth_kwargs after the kernel Session is constructed.
Bearer-token control-char sanitization in the auth bridge.
Removed dead _use_arrow_native_complex_types kwarg; tightened _reraise_kernel_error signature and dropped dead branch; collapsed three KernelResultSet(...) construction sites through one _make_result_set helper.
Cursor.columns() docstring now documents the catalog_name=None divergence on use_kernel=True.
Federation-wrapped PAT unit test now constructs a real TokenFederationProvider(http_client=Mock()) instead of bypassing __init__.

Test plan

Unit tests — 75 cases across the kernel suite:
- tests/unit/test_kernel_client.py (new in this revision, 38 cases) — covers _CODE_TO_EXCEPTION (14 entries), _reraise_kernel_error attribute forwarding, _STATE_TO_COMMAND_STATE (6 entries), all no-open-session guards, open_session double-open, parameters / query_tags rejection, get_columns catalog-required, cancel_command / close_command tolerance, get_query_state sync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId shape, close_session cleanup on partial close failures. Uses a fake databricks_sql_kernel module so the test runs with no Rust extension dependency.
- tests/unit/test_kernel_auth_bridge.py — PAT, federation-wrapped PAT (now via real TokenFederationProvider(http_client=Mock())), non-PAT rejection paths.
- tests/unit/test_kernel_type_mapping.py — Arrow type mapping per type, description-tuple shape, fallback to str() for unknowns.
- tests/unit/test_kernel_result_set.py — buffer semantics, fetchmany slicing within batch + across batch boundaries, idempotent close, close() swallowing handle-close failures, empty stream.
Full pre-existing unit suite — passes; the one pre-existing failure (test_useragent_header — agent detection adds agent/claude-code in this env, fails on main too) is unrelated to this change.
Live e2e against dogfood with use_kernel=True (PAT): SELECT 1, SELECT * FROM range(10000), fetchmany pacing, fetchall_arrow, all four metadata calls (catalogs / schemas / tables / columns), session_configuration={'ANSI_MODE': 'false'} round-trips, bad SQL surfaces as DatabaseError with code='SqlError' and sql_state='42P01'.

This pull request and its description were written by Isaac.

Phase 2 of the PySQL × kernel integration plan (databricks-sql-kernel/docs/designs/pysql-kernel-integration.md). Wires `use_sea=True` to a new `backend/kernel/` module that delegates to the Rust kernel via the `databricks_sql_kernel` PyO3 extension (kernel PR #13). New module: `src/databricks/sql/backend/kernel/` - `client.py` — `KernelDatabricksClient(DatabricksClient)`. Lazy- imports `databricks_sql_kernel` so a connector install without the kernel wheel doesn't `ImportError` at startup; only `use_sea=True` surfaces the missing-extra message. Implements open/close_session, sync + async execute_command (async_op=True goes through `Statement.submit()` and stashes the handle in a dict keyed on `CommandId`), cancel/close_command, get_query_state, get_execution_result, and the metadata calls (catalogs / schemas / tables / columns) via `Session.metadata().list_*`. Real server-issued session and statement IDs flow through (no synthetic UUIDs). - `auth_bridge.py` — translate the connector's `AuthProvider` into kernel `Session` kwargs. PAT (including federation-wrapped PAT — `get_python_sql_connector_auth_provider` always wraps the base in `TokenFederationProvider`, so a naive isinstance check never matches) routes through `auth_type="pat"`. Everything else routes through `auth_type="external"` with a callback that delegates to `auth_provider.add_headers({})`. (External today is rejected by the kernel at `build_auth_provider`; the separate kernel-side enablement PR will flip it on.) - `result_set.py` — `KernelResultSet(ResultSet)`. Duck-typed over `databricks_sql_kernel.ExecutedStatement` (sync execute) and `ResultStream` (metadata + async await_result) since both expose `arrow_schema()` / `fetch_next_batch()` / `fetch_all_arrow()` / `close()`. Same FIFO batch buffer the prior ADBC POC used, so `fetchmany(n)` for n smaller than the kernel's natural batch size doesn't re-fetch. - `type_mapping.py` — Arrow → PEP 249 description-string mapper. Lifted from the prior ADBC POC; centralised here so future kernel-result wrappers reuse the same mapping. Kernel errors → PEP 249 exceptions: `KernelError.code` is mapped in a single table to `ProgrammingError` / `OperationalError` / `DatabaseError`. The structured fields (`sql_state`, `error_code`, `query_id`, …) are copied onto the re-raised exception so callers can branch on them without reaching through `__cause__`. Routing: `Session._create_backend` flips the `use_sea=True` branch to instantiate `KernelDatabricksClient` instead of the native `SeaDatabricksClient`. The native `backend/sea/` module is left in place (no users on `use_sea=True` after this PR; its long- term fate is out of scope here). Packaging: `[tool.poetry.extras] kernel = ["databricks-sql-kernel"]`. `pip install 'databricks-sql-connector[kernel]'` pulls in the kernel wheel; `use_sea=True` without the extra raises a pointed ImportError telling the user how to install it. Known gaps (acknowledged, will be follow-ups): - Parameter binding (`execute_command(parameters=[...])`) raises NotSupportedError — PyO3 `Statement.bind_param` lands in a follow-up. - Statement-level `query_tags` raises NotSupportedError. - `get_tables(table_types=[...])` returns unfiltered rows (the native SEA backend's filter is keyed on `SeaResultSet`; needs a small port to operate on `KernelResultSet`). - External-auth end-to-end blocked on the kernel-side `AuthConfig::External` enablement PR. - Volume PUT/GET (staging operations): kernel has no Volume API. Test plan: - Unit: 37 new tests across `tests/unit/test_kernel_auth_bridge.py` (auth provider → kwargs mapping, including federation-wrapped PAT and the External trampoline call-counter check), `tests/unit/test_kernel_type_mapping.py` (Arrow type mapping + description shape), and `tests/unit/test_kernel_result_set.py` (buffer semantics, fetchmany across batch boundaries, idempotent close, close() swallowing handle-close failures). All pass. - Full unit suite: 600 pre-existing tests still pass; one pre-existing failure (`test_useragent_header` — agent detection adds `agent/claude-code` in this env) was already failing on main, unrelated to this change. - Live e2e against dogfood with `use_sea=True`: SELECT 1, `range(10000)`, `fetchmany` pacing, `fetchall_arrow`, all four metadata calls (returned 75 catalogs / 144 schemas in main / 47 tables in `system.information_schema` / 15 columns), `session_configuration={'ANSI_MODE': 'false'}` round-trips, bad SQL surfaces as DatabaseError with `code='SqlError'` and `sql_state='42P01'` on the exception. All checks pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

The earlier auth_bridge routed OAuth/MSAL/federation through the kernel's External token-provider trampoline (a Python callable the kernel invoked per HTTP request). Removing that for now. Why: routing OAuth into the kernel inherently requires per-request token resolution to keep refresh working during a long-running session. Two viable mechanisms (kernel-native OAuth, or the External callback); both have costs (duplicate OAuth flows vs GIL-per-request). Punting the decision until there's actual demand on use_sea=True. Today: the bridge accepts PAT (including TokenFederationProvider- wrapped PAT, which is how `get_python_sql_connector_auth_provider` always shapes it). Any non-PAT auth_provider raises a clear NotSupportedError pointing the user at use_sea=False (Thrift). This shrinks the auth_bridge to ~50 lines and means the kernel- side External enablement PR is no longer on the connector's critical path — there's no kernel-side prerequisite for shipping use_sea=True for PAT users. Unit tests updated: - TokenFederationProvider-wrapped PAT still routes to PAT (kept). - Generic OAuth provider raises NotSupportedError (new). - ExternalAuthProvider raises NotSupportedError (new). - Silent non-PAT provider raises NotSupportedError (new) — reject the type itself rather than trying to extract a token we already know we can't use. Live e2e against dogfood with use_sea=True (PAT): all checks still pass (SELECT 1, range(10000), fetchmany pacing, four metadata calls, session_configuration round-trip, structured DatabaseError on bad SQL). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

Moves the previously-ad-hoc /tmp/connector_smoke.py into the repo as a real pytest module under tests/e2e/ — same convention as the rest of the e2e suite. Uses the existing session-scoped `connection_details` fixture from the top-level conftest so it shares the credential surface with every other live test. 11 tests cover: - connect() with use_sea=True opens a session. - SELECT 1: rows + description shape (column name + dbapi type slug). - SELECT * FROM range(10000): multi-batch drain. - fetchmany() pacing across the buffer boundary. - fetchall_arrow() returns a pyarrow Table. - All four metadata methods (catalogs / schemas / tables / columns). - session_configuration={'ANSI_MODE': 'false'} round-trips. - Bad SQL surfaces as DatabaseError with `code='SqlError'` and `sql_state='42P01'` attached as exception attributes. Module-level skips: - `databricks_sql_kernel` not importable → whole module skipped via pytest.importorskip (the wheel hasn't been installed). - Live creds missing → fixture-level skip with a pointed message. Run: `pytest tests/e2e/test_kernel_backend.py -v`. All 11 pass against dogfood in ~20s. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala · 2026-05-14T15:35:24Z

Two updates since the initial PR:

1. Dropped External auth → PAT-only on the kernel backend (25723627). auth_bridge.py now routes PAT (including TokenFederationProvider-wrapped PAT) through the kernel's PAT path; everything else raises NotSupportedError pointing the user at use_sea=False. The kernel-side PR is no longer on this PR's critical path. Why: routing OAuth into the kernel requires per-request token resolution to keep refresh working — two viable mechanisms (kernel-native OAuth, or the External callback), both with costs. Punting the decision until there's pressure.

2. Live e2e tests moved into the repo (6b308156). The previous ad-hoc /tmp/connector_smoke.py is now tests/e2e/test_kernel_backend.py — a proper pytest module using the existing connection_details fixture. 11 tests cover: connect, SELECT 1, range(10000), fetchmany pacing, fetchall_arrow, all four metadata methods, session_configuration round-trip, structured DatabaseError on bad SQL. All 11 pass against dogfood in ~20s. Module skips cleanly when databricks_sql_kernel isn't installed or creds aren't set.

The auth_bridge unit tests are updated: OAuth providers / ExternalAuthProvider now assert NotSupportedError. 39 unit tests pass.

CI is failing across all jobs at \`poetry lock\` time: Because databricks-sql-connector depends on databricks-sql-kernel (^0.1.0) which doesn't match any versions, version solving failed. The kernel wheel isn't yet published to PyPI — we verified the name is available via the Databricks proxy, but the package itself hasn't been built and uploaded yet. Declaring it as a poetry dep (even an optional one inside an extra) requires the version to be resolvable, and \`poetry lock\` runs as the setup step for every CI job: unit tests, linting, type checks, all of them. Fix: drop the \`databricks-sql-kernel\` dep declaration and the \`[kernel]\` extra from pyproject.toml until the wheel is on PyPI. The lazy import in \`backend/kernel/client.py\` still raises a clear ImportError pointing at \`pip install databricks-sql-kernel\` (or local maturin) when use_sea=True is invoked without the kernel present. When the kernel is published, a small follow-up will add back: databricks-sql-kernel = {version = "^0.1.0", optional = true} [tool.poetry.extras] kernel = ["databricks-sql-kernel"] A pointed comment in pyproject.toml documents the deferred change. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

Three CI failures after the poetry-lock fix uncovered three real issues: 1. pyarrow is optional in the connector. The default-deps CI test job installs without it; the +PyArrow job installs with. The kernel backend's result_set.py + type_mapping.py import pyarrow eagerly (the kernel always returns pyarrow), and the unit tests import the backend at collection time — which crashes the default-deps job at ModuleNotFoundError. Fix: gate the three kernel unit tests on `pytest.importorskip( "pyarrow")` so they skip on default-deps and run on +PyArrow. Verified locally: 39 pass with pyarrow, 3 skipped without. No change to the backend module itself — nothing imports it until use_sea=True is invoked, and pyarrow is on the kernel wheel's runtime dep list so use_sea=True can't hit this either. 2. mypy: KernelDatabricksClient.open_session returns self._session_id, which mypy types as Optional[SessionId] because the field starts as None. Fix: bind the new id to a local non-Optional variable, assign to the field, return the local. CI's check-types runs cleanly on backend/kernel/ now; pre-existing mypy noise elsewhere isn't mine. 3. black --check: black 22.12.0 (the version CI pins) wants reformatting on result_set.py / type_mapping.py / client.py. Applied. Verified locally with the same black version. All 39 kernel unit tests + 619 pre-existing unit tests pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

The +PyArrow CI matrix installs pyarrow but not the databricks-sql-kernel wheel (the wheel isn't on PyPI yet, and the [kernel] extra is deferred — see commit 31ca581). The previous fix gated unit tests on `pytest.importorskip("pyarrow")` but test_kernel_auth_bridge.py was still pulled into a kernel-wheel ImportError because: src/databricks/sql/backend/kernel/__init__.py -> from databricks.sql.backend.kernel.client import KernelDatabricksClient -> import databricks_sql_kernel # ImportError on +PyArrow CI The eager re-export from `__init__.py` was a convenience that broke every consumer that only needed a submodule (type_mapping, result_set, auth_bridge) — they all triggered the kernel wheel import for no reason. Fix: - Drop the eager re-export from `kernel/__init__.py`. Comment documents why and points callers (= session.py::_create_backend, already this shape) at the direct `from .client import ...`. - Drop the no-longer-needed `pytest.importorskip("pyarrow")` / `importorskip("databricks_sql_kernel")` from test_kernel_auth_bridge.py — auth_bridge.py itself has neither dep, so the test now runs on every CI matrix variant. - test_kernel_result_set.py and test_kernel_type_mapping.py keep the pyarrow importorskip because they themselves use pyarrow. Verified locally across the three matrix shapes: - both pyarrow + kernel installed: 39 pass. - pyarrow only (no kernel wheel — the +PyArrow CI shape): 39 pass. - neither: 9 pass (auth_bridge only), 2 modules skip (the others use pyarrow). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

…sing The connector's coverage CI job runs the full e2e suite, several of whose test classes parametrize ``extra_params`` over ``{}`` and ``{"use_sea": True}``. With ``use_sea=True`` now routing through the Rust kernel via PyO3, those cases die at ``connect()`` with our pointed ImportError because the ``databricks-sql-kernel`` wheel isn't yet on PyPI — and that CI job (sensibly) doesn't try to build it from a sibling repo. Fix: ``pytest_collection_modifyitems`` hook in the top-level ``conftest.py`` that adds a ``skip`` marker to any parametrize case with ``extra_params={"use_sea": True, ...}`` when ``importlib.util.find_spec("databricks_sql_kernel")`` returns ``None``. Behavior change is CI-only — local dev with the kernel wheel installed (via ``maturin develop`` from the kernel repo) runs those cases as before. Once the kernel wheel is published, the [kernel] extra in pyproject.toml gets enabled (see comment block there) and the default-deps CI matrix will install it; the skip then becomes a no-op. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala · 2026-05-14T16:30:04Z

CI status, final: 33 / 34 checks pass.

The one failing check (test-with-coverage, the e2e + coverage job) reports 7 test failures — but all 7 are pre-existing on main, unrelated to this PR. Verified by reproducing them on an independent recent PR (PR for thrift-result-set-heartbeat, run 25725422361, 2 days ago) and confirming the same 6 TestMstMetadata cases fail there too with identical error messages.

Failure breakdown:

Test	Failure	Root cause
`TestMstMetadata::test_cursor_{columns,schemas,tables,catalogs}_in_mst` (4)	`DatabaseError: GetCatalogs/Schemas/Tables/Columns is not supported within a multi-statement transaction`	Server-side change between Apr 28 (last green main run) and today; affects every PR's e2e job
`TestMstMetadata::test_cursor_{columns,tables}_non_transactional_after_concurrent_*` (2)	Same server-side MST limitation	Same as above
`TestPySQLLargeWideResultSet::test_query_with_large_wide_result_set[True-extra_params0]` (1)	`AttributeError: 'TestPySQLLargeWideResultSet' object has no attribute 'assertEqual'`	Latent bug from #772 ("Optimize CI"). Split `LargeQueriesMixin` test classes off `unittest.TestCase` but `fetch_rows` still calls `test_case.assertEqual`. Triggers only when the fetch completes inside the 5-min budget.

My PR contributions to this job (all working as intended):

53 tests now skip cleanly (vs failing). All the extra_params={"use_sea": True} cases — see the conftest hook in 8958e76e. Once databricks-sql-kernel ships to PyPI and the [kernel] extra is wired, these run as live tests.
927 tests pass.

I don't believe my PR should be responsible for fixing the 7 pre-existing failures — they need their own fixes (server-side investigation for MST metadata; switching fetch_rows to pytest assertions for the large-result test). Happy to file follow-up issues if you want.

vikrantpuppala · 2026-05-15T10:22:38Z

-            logger.debug("Creating Thrift backend client")
-            databricks_client_class = ThriftDatabricksClient
+            # `use_sea=True` now routes through the Rust kernel via
+            # PyO3. The native pure-Python SEA backend


High severity — multi-reviewer consensus (architecture, agent-compat, devils-advocate, ops, test)

use_sea=True previously routed to SeaDatabricksClient (the pure-Python SEA REST backend). After this PR, the same kwarg routes to a Rust PyO3 wheel that isn't on PyPI yet. Side effects on existing use_sea=True users:

ImportError on upgrade unless they separately install databricks-sql-kernel

OAuth / federation / external auth → NotSupportedError (auth_bridge.py)

Parameter binding → NotSupportedError (client.py:476-484)

query_tags → NotSupportedError

Volume PUT/GET → unsupported

Telemetry mis-reports kernel sessions as DatabricksClientType.SEA

The native SEA backend (backend/sea/, ~700 LOC) is now zombie code: still in the tree, no longer reachable through any documented entry point.

The module docstring at backend/kernel/__init__.py even says the module's identity is deliberately decoupled from SEA REST (the kernel may switch transport SEA REST → SEA gRPC → …). That contradicts the flag name.

Recommend: introduce use_kernel=True as a new explicit flag. Leave use_sea=True routing to SeaDatabricksClient for now. Deprecate use_sea on a published timeline once the kernel reaches feature parity. Bundles cleanly with the related issues: docstring update at client.py:117 ("Use the SEA backend instead of the Thrift backend") is now factually wrong; CHANGELOG.md has no entry for this behavior change; no version bump; telemetry still reports DatabricksClientType.SEA for kernel sessions.

Addressed in 24e9a5c — introduced a dedicated use_kernel=True flag. use_sea=True once again routes to the native pure-Python SeaDatabricksClient (unchanged); the new flag is opt-in and mutually exclusive. PR title + description updated to match.

vikrantpuppala · 2026-05-15T10:22:40Z

+
+            logger.debug("Creating kernel-backed client for use_sea=True")
+            return KernelDatabricksClient(
+                server_hostname=server_hostname,


High severity — flagged by ops, devils-advocate, architecture, maintainability, language

The kernel branch hardcodes 8 kwargs into KernelDatabricksClient(...). The Thrift branch (below) splats **kwargs. Silently dropped on use_sea=True:

_socket_timeout

All _retry_* knobs (_retry_stop_after_attempts_count, _retry_delay_*, …)

_tls_no_verify, _tls_verify_hostname, _tls_trusted_ca_file

pool_maxsize

use_cloud_fetch, use_hybrid_disposition, enable_query_result_lz4_compression

staging_allowed_local_path

query_tags

User-agent extras

_use_arrow_native_complex_types

KernelDatabricksClient.__init__ accepts **kwargs and never references them — accept-and-ignore with zero log line.

The most dangerous case: a user setting _tls_no_verify=True on Thrift (for an on-prem proxy / self-signed cert) gets that honored. On use_sea=True it silently no-ops and the kernel's own TLS stack will verify the cert. The operator believes verification is disabled when it isn't. Same for custom CA bundle (_tls_trusted_ca_file).

Worse: DriverConnectionParameters in telemetry (client.py:396-398) continues reporting socket_timeout=kwargs.get("_socket_timeout", None) — so dashboards lie about what's actually applied.

Recommend: at minimum, log a single WARNING at session-open enumerating which kwargs the kernel backend cannot honor (one-shot per process). Long-term, plumb retry/timeout/proxy/TLS through to the kernel, or refuse to start when unsupported knobs are set.

Deferred — called out as a known gap in the updated PR description. The kernel manages its own HTTP stack today (TLS, retry, timeout, pool); we'll plumb a per-knob bridge as those surfaces appear kernel-side. Switching to use_kernel=True (24e9a5c) means existing use_sea=True callers — who relied on ssl_options / http_headers / retry knobs being honored on the SEA backend — are unaffected.

vikrantpuppala · 2026-05-15T10:22:41Z

+            logger.warning("Error closing kernel handle: %s", exc)
+        self._buffer.clear()
+        self._kernel_handle = None
+        self._exhausted = True


High severity — flagged by architecture, ops, performance

The base ResultSet.close() (src/databricks/sql/result_set.py:166-190) calls self.backend.close_command(self.command_id) to free server-side state and to drop client-side tracking. KernelResultSet.close() overrides the base entirely and calls only self._kernel_handle.close().

For the sync execute path this is OK (the executed handle owns the server statement; closing it releases server state). For the async path (execute_command(async_op=True)), the handle is also tracked in KernelDatabricksClient._async_handles. Result:

User calls cursor.execute(..., async_op=True) → handle stored in _async_handles.

User calls cursor.fetchall() → result set built.

User calls cursor.close() → KernelResultSet.close() calls _kernel_handle.close() directly.

_async_handles[command_id.guid] still holds the now-closed handle.

Later, close_session() iterates _async_handles.values() and calls .close() again on the dead handle.

The kernel's close() is idempotent per the PR's docstring, so this isn't a crash — but the bookkeeping is inconsistent and leaks an entry for every async-submitted statement that closes through the result-set path. Long-lived connections accumulate dead entries.

Recommend: KernelResultSet.close() should either (a) call super().close() (which calls backend.close_command), or (b) explicitly self.backend._async_handles.pop(self.command_id.guid, None) after closing the handle.

Fixed in 24e9a5c. KernelResultSet.close() now pops the entry from backend._async_handles (under the new _async_handles_lock). No-op for sync-execute and metadata paths, which never register there.

vikrantpuppala · 2026-05-15T10:22:43Z

+            self.has_more_rows = False
+            self.status = CommandState.SUCCEEDED
+            return False
+        if batch.num_rows > 0:


High severity — performance, empirically verified

def _buffered_rows(self) -> int: if not self._buffer: return 0 first = self._buffer[0].num_rows - self._buffer_offset rest = sum(b.num_rows for b in list(self._buffer)[1:]) # allocates + O(M) return first + rest

Two issues:

list(self._buffer)[1:] allocates a fresh list on every call — gratuitous. Use itertools.islice(self._buffer, 1, None) or iterate the deque.

_ensure_buffered calls _buffered_rows() in a loop, once per pulled batch → O(M²) in batch count.

Empirically verified with a synthetic harness:

5,000 single-row batches → ~447ms just in the row-count loop (Python-side, no pyarrow).

Hot path: fetchmany(small_N) / fetchone() repeatedly when the kernel returns many small batches, or fetchall() over a deep stream.

Fix (~10 lines): track self._buffered_count: int as a running counter — += batch.num_rows in _pull_one_batch, -= take in _take_buffered, recompute on _drain. _buffered_rows() becomes O(1); _ensure_buffered becomes O(M).

Fixed in 37fa544. Replaced _buffered_rows with a running counter _buffered_count maintained by _pull_one_batch / _take_buffered / _drain. _buffered_rows is now O(1); _ensure_buffered is O(M) in batch count instead of O(M²).

vikrantpuppala · 2026-05-15T10:22:44Z

+
 @pytest.fixture(scope="session")
 def host():
    return os.getenv("DATABRICKS_SERVER_HOSTNAME")


High severity — flagged by test, ops

The new pytest_collection_modifyitems skips every extra_params={"use_sea": True} case when the kernel wheel isn't importable. CI installs --all-extras (.github/workflows/code-coverage.yml:39), but pyproject explicitly does NOT declare a [kernel] extra (per the PR's own comment at pyproject.toml:55-68). Result: every one of those cases is reported SKIPPED on every CI run.

Verified via grep against tests/e2e/test_driver.py — there are ~14 distinct @pytest.mark.parametrize("extra_params", [{}, {"use_sea": True}]) functions (test_query_with_large_wide_result_set, test_long_running_query, test_execute_async__*, test_unicode, test_fetchone, test_fetchall, test_fetchmany_*, test_iterator_api, test_multi_timestamps_arrow, …) plus 4 SEA-parametrized retry tests in tests/e2e/common/retry_test_mixins.py.

Before this PR they ran against SeaDatabricksClient. After this PR they vanish from the CI matrix entirely. This is a silent capability loss across the whole e2e suite, not just within the new code. Combined with F9 (no unit tests for the 511-LOC client.py), coverage of the kernel backend in CI may also fail the 85% threshold gate at .github/workflows/code-coverage.yml:80-81.

Recommend: (a) install the kernel wheel in CI before running e2e, OR (b) document this regression in the PR description so reviewers know use_sea=True e2e coverage in CI is currently 0, AND (c) file a follow-up to land kernel-wheel CI coverage.

Addressed in 24e9a5c — removed the conftest collection hook entirely. With use_sea=True back on the native SEA backend, the existing extra_params=[{}, {"use_sea": True}] parametrized cases run as they did before this PR (no skip needed). The kernel backend is now opt-in via use_kernel=True and doesn't intercept existing e2e parametrizations.

vikrantpuppala · 2026-05-15T10:22:53Z

@@ -0,0 +1,511 @@
+"""``DatabricksClient`` backed by the Rust kernel via PyO3.


High severity — flagged by test, devils-advocate

auth_bridge, result_set, and type_mapping each get unit coverage; the largest, most behavior-rich file in the PR has none. Only the e2e file exercises it — and that file skips silently when creds OR the kernel wheel are missing (which is the CI default, see F8).

Concretely uncovered by any non-live test:

_CODE_TO_EXCEPTION mapping (14 entries) — trivially testable with a fake _kernel.KernelError; a typo on "Unauthenticated" collapses silently to DatabaseError.

_reraise_kernel_error attribute forwarding — copies 7 structured fields (code, sql_state, error_code, vendor_code, http_status, retryable, query_id). E2E only verifies code + sql_state.

open_session double-open guard — raise InterfaceError(...) not exercised.

All 5 InterfaceError "no open session" guards — execute_command, get_catalogs, get_schemas, get_tables, get_columns.

get_columns catalog_name required check — e2e always passes a catalog.

execute_command parameters / query_tags NotSupportedError — regression risk that accepting parameters would silently dispatch to a kernel Statement with no bind_param.

cancel_command / close_command no-handle tolerant path.

close_session cleanup of _async_handles including swallow-on-KernelError.

get_query_state sync-path SUCCEEDED shortcut and Failed-state re-raise.

_STATE_TO_COMMAND_STATE mapping (6 entries).

max_download_threads property.

get_tables table_types warning behavior.

All achievable with a MagicMock _kernel module via monkeypatch.setattr. The auth-bridge and result-set tests demonstrate the pattern. The absence of tests/unit/test_kernel_client.py is the single biggest test-quality issue in this PR.

Recommend: add tests/unit/test_kernel_client.py — ~150 LOC covers the items above.

Addressed in 24e9a5c. Added tests/unit/test_kernel_client.py with 38 cases covering: full _CODE_TO_EXCEPTION (14-entry parametrize), _reraise_kernel_error attribute forwarding, full _STATE_TO_COMMAND_STATE (6-entry parametrize), all 5 no-open-session guards, open_session double-open, parameters and query_tags rejection, get_columns catalog-required, cancel_command / close_command tolerance, get_query_state sync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId UUID shape, and close_session cleanup-on-failure. Uses a fake databricks_sql_kernel module installed into sys.modules so it runs without the Rust extension. 77/77 kernel unit tests pass locally.

vikrantpuppala · 2026-05-15T10:22:55Z

+        federated.add_headers = base.add_headers
+        kwargs = kernel_auth_kwargs(federated)
+        assert kwargs == {"auth_type": "pat", "access_token": "dapi-abc"}
+


Medium severity — flagged by language, maintainability, devils-advocate, test, security

federated = TokenFederationProvider.__new__(TokenFederationProvider) federated.external_provider = base federated.add_headers = base.add_headers

This bypasses __init__ (which requires http_client and normalizes hostname) AND monkey-patches over the real TokenFederationProvider.add_headers — the one containing all the token-exchange logic. The assertion kwargs == {"auth_type": "pat", "access_token": "dapi-abc"} therefore says nothing about whether the bridge handles the real federation flow. It passes "by accident" because a dapi-… token isn't a JWT, so the real path's _should_exchange_token happens to return False.

Any future change to TokenFederationProvider.add_headers (added telemetry, new refresh trigger, eager exchange) silently breaks the bridge while this test stays green. With a real-init federated provider, add_headers writes the federation-exchanged token (not the original PAT) — the bridge would extract that token, not the underlying PAT. The test name test_federation_wrapped_pat_routes_to_kernel_pat overstates what's verified.

tests/unit/test_token_federation.py:31-36 already demonstrates clean construction with a MagicMock http_client.

Recommend:

federated = TokenFederationProvider( hostname="https://example.cloud.databricks.com", external_provider=base, http_client=Mock(), )

Fixed in 37fa544. The test now constructs a real TokenFederationProvider(http_client=Mock()) and exercises its actual add_headers path; for a plain dapi-… PAT _should_exchange_token returns False (not a JWT) so no exchange fires and the mock http_client is never invoked.

vikrantpuppala · 2026-05-15T10:22:58Z

+    def get_catalogs(
+        self,
+        session_id: SessionId,
+        max_rows: int,


Medium severity — flagged by architecture, devils-advocate, ops

_async_handles is a single dict per KernelDatabricksClient. Multiple cursors on the same connection share it via self.backend. Mutations happen in execute_command (insert), close_command (pop), cancel_command (get), close_session (iterate-then-clear) — all unlocked.

Two threads issuing async statements concurrently are safe in CPython by GIL accident but not by design. Worse, close_session does for handle in list(self._async_handles.values()): ...; self._async_handles.clear() — a thread mid-execute_command(async_op=True) could add a new handle after the iterator copy is taken but before clear(). That handle is dropped on the floor with no .close() called — kernel-side state leaks.

The connector explicitly documents thread-safety per cursor; this regresses below that bar for shared async tracking.

Recommend: wrap mutations in a threading.RLock, or document non-thread-safety in the class docstring with an explicit warning.

Fixed in 24e9a5c. Added self._async_handles_lock = threading.RLock() and wrapped every read/mutation site (execute_command insert, cancel_command / close_command / get_query_state / get_execution_result reads, close_session iterate+clear). The close_session pattern is now snapshot-under-lock then close-outside-lock — newly-added handles after the snapshot stay in the dict for the next sweep instead of being dropped on the floor.

vikrantpuppala · 2026-05-15T10:22:59Z

+            None,
+        )
+        for field in schema
+    ]


Medium severity — flagged by agent-compat, maintainability

return str(arrow_type) for unrecognized types produces strings like "fixed_size_binary[16]" or "timestamp[ns, tz=UTC]" in cursor.description[i][1]. Code (and LLM agents) branching on description type strings — a common pattern, e.g., if col_type == "timestamp": — silently miss these cases.

The unit test only verifies pa.null() falls through to "null"; the visually ugly cases aren't covered.

pa.timestamp("us") and pa.timestamp("ns", tz="UTC") both pass is_timestamp and map to "timestamp" (fine), but other parametrized types (fixed-size, dictionary-encoded, union) fall to str(arrow_type).

Recommend: either (a) document the fallback shape in the module docstring so callers know to handle parameterized type strings, (b) lowercase + strip parameters before returning, or (c) extend the explicit list to cover the parametrized variants the kernel actually returns.

Deferred. Documented as a follow-up — the kernel will eventually return a richer Arrow type surface, and the right shape is to expand the explicit table when kernel adds parameterized types we care about, not to lossily lowercase/strip on the connector side.

vikrantpuppala · 2026-05-15T10:23:08Z

+            return
+        try:
+            handle.close()
+        except _kernel.KernelError as exc:


Medium severity — flagged by language

Signature is (exc: BaseException) -> "Error" with # type: ignore[return-value] on the non-KernelError passthrough. The ignore is masking a real type issue: this function's contract is "either re-raise as Error or pass through."

Every caller in the file does raise _reraise_kernel_error(exc) after an isinstance(exc, KernelError) check (12 callers), so the non-KernelError passthrough is unreachable in practice.

Recommend: delete the dead branch and tighten the signature to (exc: _kernel.KernelError) -> Error (called only after an isinstance check). Or, if the passthrough is intentional, change return-type to Union[BaseException, Error] and drop the ignore. Deleting is simpler.

Fixed in 37fa544. Tightened the signature to (exc: _kernel.KernelError) -> Error, dropped the unreachable passthrough branch and the # type: ignore[return-value], and replaced the defensive setattr try/except with a plain setattr(new, attr, getattr(exc, attr, None)) since none of the PEP 249 exception classes use __slots__.

vikrantpuppala · 2026-05-15T10:23:09Z

+            buffer_size_bytes=cursor.buffer_size_bytes,
+        )
+
+    # ── Metadata ───────────────────────────────────────────────────


Medium severity — flagged by maintainability, language

_use_arrow_native_complex_types: Optional[bool] = True is accepted in __init__ but never read. Not passed by session.py::_create_backend either. Drop entirely.

Also: Optional[bool] = True is essentially Union[bool, None] defaulted to True — the Optional is misleading because None and True are both acceptable for what would be a "use default" sentinel. If kept, restrict to bool = True.

Fixed in 37fa544. Removed _use_arrow_native_complex_types from KernelDatabricksClient.__init__ — it was accepted but never read, and session.py::_create_backend doesn't pass it for the kernel branch.

vikrantpuppala · 2026-05-15T10:23:11Z

+batch size; ``fetchall`` drains the whole stream.
+"""
+
+from __future__ import annotations


Medium severity — flagged by test, language

The class docstring claims the duck-typed kernel_handle must implement arrow_schema() / fetch_next_batch() / fetch_all_arrow() / close(). The production code never calls fetch_all_arrow — KernelResultSet streams via fetch_next_batch() + a custom _drain. The _FakeKernelHandle test double also doesn't implement fetch_all_arrow.

If the kernel adds a required method (e.g., fetch_next_batch(timeout=...) becomes mandatory), unit tests still pass and the regression lands in e2e — which is silently skipped per F8.

Recommend:

Drop fetch_all_arrow from the docstring contract, OR

Add a contract test that (when databricks_sql_kernel is importable) asserts the duck-typed methods are actually exposed.

Fixed in 37fa544. Renamed _metadata_result → _make_result_set and routed the sync-execute path (was client.py:510-517) and get_execution_result (was client.py:577-584) through it. Single construction site now.

vikrantpuppala · 2026-05-15T10:23:12Z

+        arraysize: int,
+        buffer_size_bytes: int,
+    ):
+        schema = kernel_handle.arrow_schema()


Low severity — flagged by performance

buffer_size_bytes is accepted by the constructor and forwarded to the base ResultSet, but never consulted by the kernel backend. The kernel currently caps buffer by rows-pulled, not bytes.

Recommend: document the no-op (a comment in the class docstring or constructor) so callers tuning buffer_size_bytes for memory ceilings on Thrift know it doesn't apply on use_sea=True.

Fixed in 37fa544. Added a paragraph to the KernelResultSet class docstring explicitly documenting that buffer_size_bytes is accepted for base-class contract compatibility but is not consulted — kernel currently caps by rows pulled, not bytes.

vikrantpuppala · 2026-05-15T10:23:13Z

+            lz4_compressed=False,
+            arrow_schema_bytes=None,
+        )
+        self._kernel_handle = kernel_handle


Low severity — flagged by maintainability

The base ResultSet expects a results_queue with a next_n_rows / remaining_rows / close interface — both ThriftResultSet and SeaResultSet use it. KernelResultSet passes results_queue=None and duplicates _pull_one_batch / _ensure_buffered / _take_buffered / _drain (~80 lines) plus its own fetch overrides.

The docstring even acknowledges the duplication: "Buffer shape mirrors the prior ADBC POC's AdbcResultSet."

Recommend (follow-up): extract BufferedArrowQueue(results_queue) wrapping any handle implementing arrow_schema() / fetch_next_batch() / close(). Both KernelResultSet and any future AdbcResultSet become 15-line constructor-only subclasses, and the base ResultSet.fetchmany_arrow / fetchall_arrow works unchanged.

Deferred — this is a cross-cutting refactor that would touch both kernel and SEA backends. Tracked as a follow-up; appropriate to land alongside the ADBC POC if/when it gets revived.

vikrantpuppala · 2026-05-15T10:23:14Z

+            buffer_size_bytes=cursor.buffer_size_bytes,
+        )
+
+    def _synthetic_command_id(self) -> CommandId:


Low severity — flagged by security

self._auth_kwargs = {"auth_type": "pat", "access_token": <raw_token>} is stored on the KernelDatabricksClient instance for its life. Thrift / native-SEA materialize the token only inside per-request add_headers calls. The kernel client elevates the cleartext token to a long-lived attribute on a connector object — at risk of accidental pickling, debugger dumps, or telemetry capture.

Recommend: clear self._auth_kwargs (or just self._auth_kwargs["access_token"]) immediately after _kernel.Session(...) returns in open_session. Or move it into a closure rather than an instance attribute.

Fixed in 37fa544. Added a finally block to open_session that pops access_token from self._auth_kwargs after the kernel Session is constructed (or failed). Kernel owns the credential from then on; no cleartext copy stays on the long-lived connector object.

vikrantpuppala · 2026-05-15T10:23:22Z

+    kernel side."""
+    with conn.cursor() as cur:
+        cur.execute("SELECT * FROM range(10000)")
+        rows = cur.fetchall()


Low severity — flagged by test

Docstring says SELECT * FROM range(10000) "exercises the CloudFetch / multi-batch path on the kernel side". 10000 BIGINT rows is ~80 KB — almost certainly a single inline chunk on a typical warehouse. Existing CloudFetch-aimed tests (tests/e2e/test_driver.py:145-180) use 100 MB / 12.5M rows.

The comment overstates scope. A future reader may believe CloudFetch is covered when it isn't.

Recommend: drop the misleading claim, or scale to range(2_000_000) to actually cross a chunk boundary.

Fixed in 37fa544. Replaced the misleading 'exercises CloudFetch / multi-batch path' claim with a note that it covers end-of-stream drain over multiple fetch_next_batch calls and isn't large enough for CloudFetch — pointing to test_driver for CloudFetch coverage.

vikrantpuppala · 2026-05-15T10:23:23Z

+
+from __future__ import annotations
+
+from unittest.mock import MagicMock


Low severity — flagged by test

from unittest.mock import MagicMock is imported but never used in this file. Dead import.

Fixed in 37fa544. Replaced MagicMock import with Mock since the federation test now needs the latter; no longer unused.

vikrantpuppala · 2026-05-15T10:25:10Z

Code Review Squad — Failed Inline Comments

Could not post inline comments for: F2, F4, F7, F10, F11, F13, F14, F19, F22, F25 — see body below.

F2 — Federated-PAT token refresh is dead — single snapshot at construct time

High severity — flagged by security + devils-advocate

kernel_auth_kwargs extracts the bearer token once via auth_provider.add_headers({}) at construct time. The result is stored on KernelDatabricksClient._auth_kwargs and spread into _kernel.Session(...) at open_session. After that, the kernel uses that frozen token for the session lifetime.

get_python_sql_connector_auth_provider wraps every provider in TokenFederationProvider. For a PAT whose issuer host differs from the workspace host, TokenFederationProvider._get_token (src/databricks/sql/auth/token_federation.py:131-153) performs an OAuth token-exchange and caches the exchanged token, refreshing it on subsequent add_headers calls when it expires.

The bridge captures only the first exchanged token and never re-extracts. Long-running kernel sessions outlive the exchanged token's TTL and start failing Unauthenticated mid-session — even though the connector-side TokenFederationProvider would happily mint a fresh one.

The bridge's own docstring acknowledges this failure mode while justifying OAuth rejection: "routing OAuth through PAT would silently break token refresh during long-running sessions." That exact failure mode applies to the federated-PAT case the bridge accepts.

Recommend: either (a) propagate a refresh callback into the kernel Session, (b) block federation-wrapped PAT until the kernel exposes a refresh hook, or (c) refresh _auth_kwargs immediately before each kernel call (defeats the purpose of caching but is correct).

F4 — `get_tables` with `table_types` silently returns wrong-answer result

High severity — flagged by devils-advocate, agent-compat, ops, security

The warning says "client-side table_types filter not yet implemented … returning unfiltered rows for %r" — and then the code passes table_types=table_types to _kernel_session.metadata().list_tables(...) anyway on the next line.

Either the kernel filters server-side (in which case the warning is wrong and misleading — callers think they're getting unfiltered rows when they aren't) or it doesn't (the kwarg passthrough is dead). Both states are bad:

BI tools / schema browsers call getTables(table_types=["TABLE", "VIEW"]) and iterate. If the result is unfiltered, they silently see other table types mixed in — producing wrong UI / wrong downstream answers.
Access-boundary callers using table_types as a filter get either a silent wrong answer or an undocumented divergence from the connector contract.

logger.warning is also invisible to programmatic consumers — it's not actionable.

Recommend: until the filter is correctly implemented, raise NotSupportedError when table_types is non-empty. Same pattern as parameters and query_tags rejection elsewhere in this client. A logger.warning plus silent wrong-answer is the worst of both worlds.

F7 — `get_tables` warning is log spam at metadata-call frequency

High severity — flagged by ops, agent-compat, architecture

logger.warning fires on every get_tables call with a non-empty table_types. Metadata calls are made by every BI tool that browses a workspace: Power BI, Tableau, dbt, DataGrip, etc. — all call getTables(table_types=["TABLE", "VIEW"]) on connection open and every catalog-tree refresh. At Databricks customer scale this is easily 100s of QPS per workspace.

WARN-level log spam at that rate triggers alert fatigue for any operator with a "WARN-or-above" alarm and dominates the kernel-backend portion of stdout/stderr at scale.

Recommend: demote to logger.debug, or move to a once-per-session lazy-init flag (operator sees it once during the session, not on every call). Even logger.info is borderline at metadata-call frequency.

F10 — Telemetry is structurally degraded for kernel-backed cursors

Medium severity — flagged by ops, architecture

Three places the per-statement / per-chunk telemetry that downstream dashboards depend on is silently broken for use_sea=True:

_extract_cursor_data:54 reads cursor.active_result_set.results and isinstance-checks against ColumnQueue / CloudFetchQueue / ArrowQueue. KernelResultSet.__init__ passes results_queue=None → self.results is None on every kernel result set. Every kernel-backed query will telemetry-report ExecutionResultFormat.FORMAT_UNSPECIFIED — a silent dashboard-level regression for anyone slicing latency by execution format.
_extract_cursor_data:68 reads cursor.backend.retry_policy for retry_count. KernelDatabricksClient has no retry_policy attribute — kernel manages retries internally. Every kernel-backed cursor will report retry_count=0. Retry-count anomalies are a leading signal for partial outages.
CloudFetch chunk-level telemetry (@log_latency on ResultSetDownloadHandler) only fires when the connector's own CloudFetch downloader runs. The kernel does its own HTTP fetching of cloud URLs — so chunk-level latency / retry / decompression telemetry is entirely absent.

Plus: connection-level DriverConnectionParameters still reports DatabricksClientType.SEA for kernel sessions (src/databricks/sql/client.py:379-381).

Recommend: either (a) plumb a telemetry hook from the kernel through KernelResultSet (kernel exposes chunk-fetch / retry-count summaries that the connector forwards into SqlExecutionEvent), or (b) document this as a known-limitation banner in the PR description and acknowledge dashboards keyed on these fields under-report for use_sea=True.

F11 — `cancel_command` is a no-op on the sync-execute path; `get_query_state` reports SUCCEEDED

Medium severity — flagged by ops, devils-advocate, agent-compat

Sync executes never populate _async_handles. So calling cancel_command for a still-draining sync cursor logs at DEBUG and returns. Combined with get_query_state (client.py:543-549) returning CommandState.SUCCEEDED for unknown command_ids, a cursor.cancel() followed by get_query_state() falsely reports success — the query continues running server-side.

The inline comment claims this matches Thrift's tolerant behavior — but Thrift's tolerance is for race-y double-cancels, not "cancel was called for a command we never tracked."

Recommend: either (a) for sync-execute results, store the executed handle in a separate _sync_handles map keyed on command_id so cancel_command can call handle.cancel() on it, or (b) explicitly document that cursor.cancel() after execute() returned (i.e., during result drain) is a no-op on use_sea=True. Today it's both silently broken and not documented.

F13 — `get_columns(catalog_name=None)` raises ProgrammingError; Thrift accepts None — silent semantic drift

Medium severity — flagged by devils-advocate, agent-compat, security, architecture

KernelDatabricksClient.get_columns raises ProgrammingError("get_columns requires catalog_name on the kernel backend.") when catalog_name is None. The Thrift backend (src/databricks/sql/backend/thrift_backend.py:1217-1244) accepts catalog_name=None and forwards it to the server. Cursor.columns() (src/databricks/sql/client.py:1567-1593) keeps catalog_name: Optional[str] = None in its public signature.

Cross-backend code (third-party tooling doing schema discovery) that worked on use_sea=False will start raising on the same call under use_sea=True. The error message is actionable, but the divergence isn't documented in the Cursor.columns() docstring.

Recommend: either (a) narrow the Cursor.columns() signature for kernel mode, (b) fall back to a per-catalog scan inside the kernel client, or (c) document the divergence on Cursor.columns() so users know to branch on backend.

F14 — Synthetic `metadata-{uuid}` CommandIds pollute telemetry and `cursor.query_id`

Medium severity — flagged by architecture, maintainability, devils-advocate

_synthetic_command_id returns CommandId.from_sea_statement_id(f"metadata-{uuid.uuid4()}") — i.e., a CommandId(backend_type=SEA, guid="metadata-..."). Followed the consumers:

Cursor.query_id returns self.active_command_id.to_hex_guid() which for SEA-typed CommandId returns the literal string "metadata-9f3d…".
Telemetry log builders (src/databricks/sql/telemetry/) record query_id verbatim. Synthetic IDs pollute the telemetry stream.
Anything downstream that does int(statement_id, 16) or uuid.UUID(statement_id) would choke. No such consumer in src/ today, but downstream observability pipelines that ingest query_id may.

The synthetic ID also misrepresents the backend (BackendType.SEA when this is the kernel).

Recommend: either (a) make the synthetic ID look like a UUID (uuid.uuid4().hex with no metadata- prefix — lose human-readability for telemetry safety), (b) add a CommandId.for_kernel_metadata() factory in backend/types.py so the synthetic-ness is explicit, or (c) don't set active_command_id for metadata calls (let query_id stay None as it does for any cursor before execute).

F19 — Three near-identical `return KernelResultSet(...)` blocks should call `_metadata_result`

Medium severity — flagged by maintainability

client.py:510-517 (sync execute), client.py:577-584 (get_execution_result), and client.py:588-596 (metadata) all build KernelResultSet with the same 6 args. The metadata path is already factored out as _metadata_result(stream, cursor, command_id). The other two should call it too — saves 12 lines and removes a divergence risk if a new kwarg is added to KernelResultSet.

F22 — `auth_bridge._is_pat` duplicates `TokenFederationProvider`-peek logic from telemetry

Low severity — flagged by maintainability

_is_pat peeks through TokenFederationProvider to find a wrapped AccessTokenAuthProvider. The same pattern lives at src/databricks/sql/telemetry/telemetry_client.py:94-101:

if isinstance(auth_provider, TokenFederationProvider):
    return TelemetryHelper.get_auth_mechanism(auth_provider.external_provider)
if isinstance(auth_provider, AccessTokenAuthProvider):
    return AuthMech.PAT

Recent commit cbd6a883 ("Telemetry: unwrap TokenFederationProvider…") shows the team explicitly cares about this surface.

Recommend: extract auth/utils.py::unwrap_to_pat_provider(auth_provider) -> Optional[AccessTokenAuthProvider] shared by both call sites. Keeps them in lock-step when OAuth-through-kernel ships. Not a blocker.

F25 — No control-char / CRLF sanitization on extracted bearer token

Low severity — flagged by security

_extract_bearer_token returns auth[len("Bearer "):] verbatim. AccessTokenAuthProvider.__init__ doesn't validate either. User-supplied access_token= strings containing \r\n, \0, or other control chars flow straight through to the kernel. If the kernel layer ever places this back into an HTTP header without scrubbing, it's a header-injection sink.

Recommend: defense-in-depth — reject tokens matching [\x00-\x1f\x7f] at the connector boundary. Cheap.

Summary: 17 of 27 findings posted as inline comments. 10 failed due to line-number validation (not in diff context). All critical/high/medium findings are covered — either by inline comment or in this summary.

Cleanup pass on the kernel-backend PR addressing reviewer feedback that doesn't change observable behaviour: - result_set.py: replace O(M²) `_buffered_rows` with running counter `_buffered_count` maintained by pull/take/drain (perf F6). - result_set.py: docstring corrections — drop nonexistent `fetch_all_arrow` from kernel-handle contract (F20); document `buffer_size_bytes` as no-op on the kernel backend (F21). - client.py: tighten `_reraise_kernel_error` signature to `_kernel.KernelError` only; drop dead passthrough branch and the defensive setattr try/except (F17). - client.py: drop unused `_use_arrow_native_complex_types` kwarg (F18). - client.py: collapse three `KernelResultSet(...)` construction sites through `_make_result_set` (renamed from `_metadata_result`) (F19). - client.py: drop `metadata-` prefix from synthetic CommandId; use a plain `uuid.uuid4().hex` so anything reading `cursor.query_id` downstream sees a UUID-shaped string (F14). - client.py: clear the raw access token from `_auth_kwargs` after the kernel session is constructed — kernel owns the credential from then on, no need to retain a cleartext copy on the connector instance (F24). - auth_bridge.py: reject bearer tokens containing ASCII control characters at extraction time (defense-in-depth against header injection if a misbehaving HTTP stack ever places the token back into a header without scrubbing) (F25). - tests/unit/test_kernel_auth_bridge.py: construct a real `TokenFederationProvider(http_client=Mock())` instead of bypassing `__init__` with `__new__` + monkey-patching `add_headers`. Exercises the real federation passthrough path the bridge sees in production (F12). Drop unused `MagicMock` import (F27). - tests/e2e/test_kernel_backend.py: drop misleading CloudFetch claim on `test_drain_large_range_to_arrow` — 10000 BIGINT rows is ~80 KB, single inline chunk on a typical warehouse (F26). All 39 existing kernel unit tests pass. Co-authored-by: Isaac

…ve review fixes Major change: route the kernel backend through a new ``use_kernel=True`` connection kwarg instead of repurposing ``use_sea=True``. ``use_sea=True`` once again routes to the native pure-Python SEA backend (no behaviour change); ``use_kernel=True`` routes to the Rust kernel via PyO3. The two flags are mutually exclusive. This addresses the largest reviewer concern from the multi-agent review: silently hijacking a documented public flag broke OAuth / federation / parameter-binding callers on ``use_sea=True`` who had no opt-out. With the new flag, the kernel backend is fully opt-in and existing ``use_sea=True`` users continue to get the native SEA backend they signed up for. Other substantive fixes: - session.py: restore ``SeaDatabricksClient`` import + routing. Reject ``use_kernel=True`` + ``use_sea=True`` together with a clear ``ValueError``. - client.py (kernel ``Cursor.columns``): update docstring to flag the ``catalog_name=None`` divergence — kernel requires a catalog, Thrift / native SEA do not (F13). - conftest.py: drop the collection-time ``pytest_collection_modifyitems`` hook that was skipping ``extra_params={"use_sea": True}`` cases. With ``use_sea=True`` back on the native SEA backend, those cases run as they did before this PR (F8). - kernel/client.py: ``get_tables`` now applies the ``table_types`` filter client-side using ``ResultSetFilter._filter_arrow_table`` (the same helper the native SEA backend uses), wrapped in a tiny ``_StaticArrowHandle`` that flows the filtered table back through the normal ``KernelResultSet`` path. Replaces the previous "log a warning and return unfiltered" behaviour (F4). - kernel/client.py: guard ``_async_handles`` with ``threading.RLock`` so concurrent cursors on the same connection don't race on submit / close / close-session (F15). - kernel/result_set.py: ``KernelResultSet.close()`` now drops the entry from ``backend._async_handles`` so async-submitted statements don't leave stale references behind (F5). - kernel/{__init__,client,auth_bridge}.py, tests/e2e/test_kernel_backend.py: update docstrings, error messages, and the e2e fixture to refer to ``use_kernel=True`` instead of ``use_sea=True``. - client.py (``Connection`` docstring): document the new ``use_kernel`` kwarg + its Phase-1 limitations. New tests: - tests/unit/test_kernel_client.py (38 cases): cover the 14-entry ``_CODE_TO_EXCEPTION`` table, ``_reraise_kernel_error`` attribute forwarding, the 6-entry ``_STATE_TO_COMMAND_STATE`` table, the no-open-session guards on every method, ``open_session`` double-open, ``parameters`` / ``query_tags`` rejection, ``get_columns``' catalog-required check, ``cancel_command`` / ``close_command`` no-handle tolerance, ``get_query_state`` sync-path SUCCEEDED, the Failed-state re-raise, the synthetic-command-id UUID shape, and ``close_session`` cleanup even when per-handle close errors fire. Uses a fake ``databricks_sql_kernel`` module installed into ``sys.modules`` so the test runs with no Rust extension dependency (F9). 77/77 kernel unit tests pass. Co-authored-by: Isaac

vikrantpuppala · 2026-05-15T10:59:22Z

Code-review responses — summary-only findings

Replies to the findings that landed in the summary comment (couldn't be posted as inline replies because the cited line was outside the diff hunk):

F2 — Federated-PAT token refresh dead. Acknowledged as a known gap in the updated PR description. The bridge currently snapshots the bearer at construct time; refresh has to come from the kernel side (no Python-side hook today). Switching to use_kernel=True (24e9a5c) makes this opt-in, so existing use_sea=True callers are unaffected.
F4 — table_types silent wrong-answer. Fixed in 24e9a5c. get_tables now applies a client-side filter using the SEA backend's ResultSetFilter._filter_arrow_table (column index 5 = TABLE_TYPE, case-sensitive), wrapped in a tiny _StaticArrowHandle that flows the filtered table back through the normal KernelResultSet path. Same semantics as the native SEA backend.
F7 — get_tables warning is log spam. Resolved as a side-effect of F4 — the warning is gone since the filter is now correctly applied.
F10 — Telemetry parity. Acknowledged as a known gap in the updated PR description. Per-statement execution_result / retry_count / chunk-level latency under-report for use_kernel=True. Will plumb kernel-side hooks as they appear.
F11 — cancel_command sync no-op. Documented; tracking as a follow-up. The kernel currently doesn't return cancelable handles from sync execute(); once it does, we'll track them in a _sync_handles map.
F13 — get_columns(catalog_name=None) divergence. Fixed in 24e9a5c — Cursor.columns() docstring now documents that catalog_name is required on use_kernel=True.
F14 — Synthetic metadata-{uuid} CommandId. Fixed in 37fa544 — now uses uuid.uuid4().hex (no prefix), so cursor.query_id stays parseable downstream.
F19 — Three redundant KernelResultSet(...) blocks. Fixed in 37fa544 — all three sites now route through _make_result_set (renamed from _metadata_result).
F22 — Auth-bridge / telemetry duplication. Deferred as a cross-cutting refactor — extracting an auth/utils.py::unwrap_to_pat_provider shared by both call sites would be its own follow-up.
F25 — Bearer token control-char sanitization. Fixed in 37fa544. _extract_bearer_token now rejects tokens matching [\x00-\x1f\x7f] with a clear ValueError at extraction time.

77/77 kernel unit tests passing locally on the new branch tip 24e9a5c2.

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 15:11 — with GitHub Actions Failure

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 15:32 — with GitHub Actions Failure

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 15:35 — with GitHub Actions Failure

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 15:39 — with GitHub Actions Failure

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 15:49 — with GitHub Actions Failure

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 15:56 — with GitHub Actions Failure

vikrantpuppala had a problem deploying to azure-prod May 14, 2026 16:12 — with GitHub Actions Failure

vikrantpuppala mentioned this pull request May 15, 2026

feat(backend/kernel): wire TSparkParameter through to kernel bind_param #789

Open

4 tasks

vikrantpuppala commented May 15, 2026

View reviewed changes

vikrantpuppala added 2 commits May 15, 2026 10:45

vikrantpuppala changed the title ~~feat(backend/kernel): route use_sea=True through the Rust kernel via PyO3~~ feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3 May 15, 2026

vikrantpuppala had a problem deploying to azure-prod May 15, 2026 10:57 — with GitHub Actions Failure

		@@ -0,0 +1,511 @@
		"""``DatabricksClient`` backed by the Rust kernel via PyO3.


		from __future__ import annotations

		from unittest.mock import MagicMock

Conversation

vikrantpuppala commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Flag semantics

What use_kernel=True does

New module layout

Error mapping

Packaging

⚠️ Known gaps — acknowledged follow-ups

Code review feedback addressed in this revision

Test plan

Uh oh!

vikrantpuppala commented May 14, 2026

Uh oh!

vikrantpuppala commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vikrantpuppala commented May 14, 2026 •

edited

Loading

What `use_kernel=True` does

F4 — `get_tables` with `table_types` silently returns wrong-answer result

F7 — `get_tables` warning is log spam at metadata-call frequency

F11 — `cancel_command` is a no-op on the sync-execute path; `get_query_state` reports SUCCEEDED

F13 — `get_columns(catalog_name=None)` raises ProgrammingError; Thrift accepts None — silent semantic drift

F14 — Synthetic `metadata-{uuid}` CommandIds pollute telemetry and `cursor.query_id`

F19 — Three near-identical `return KernelResultSet(...)` blocks should call `_metadata_result`

F22 — `auth_bridge._is_pat` duplicates `TokenFederationProvider`-peek logic from telemetry